<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic
  PUBLIC "-//OASIS//DTD DITA Composite//EN" "ditabase.dtd">
<topic id="topic1">
  <title id="bd20941">ANALYZE</title>
  <body>
    <p id="sql_command_desc">Collects statistics about a database.</p>
    <section id="section2">
      <title>Synopsis</title>
      <codeblock id="sql_command_synopsis">ANALYZE [VERBOSE] [<varname>table</varname> [ (<varname>column</varname> [, ...] ) ]]

ANALYZE [VERBOSE] {<varname>root_partition_table_name</varname>|<varname>leaf_partition_table_name</varname>} [ (<varname>column</varname> [, ...] )] 

ANALYZE [VERBOSE] ROOTPARTITION {ALL | <varname>root_partition_table_name</varname> [ (<varname>column</varname> [, ...] )]}</codeblock>
    </section>
    <section id="section3">
      <title>Description</title>
      <p><codeph>ANALYZE</codeph> collects statistics about the contents of tables in the database,
        and stores the results in the system table <i>pg_statistic</i>. Subsequently, Greenplum
        Database uses these statistics to help determine the most efficient execution plans for
        queries. For information about the table statistics that are collected, see <xref
          href="#topic1/section5" format="dita">Notes</xref>.</p>
      <p>With no parameter, <codeph>ANALYZE</codeph> collects statistics for every table in the
        current database. You can specify a table name to collect statistics for a single table. You
        can specify a set of column names in a specific table, in which case the statistics only for
        those columns from that table are collected.</p>
      <p><codeph>ANALYZE</codeph> does not collect statistics on external tables. </p>
      <p>For partitioned tables, <codeph>ANALYZE</codeph> collects additional statistics,
        HyperLogLog (HLL) statistics, on the leaf child partitions. HLL statistics are used are used
        to derive number of distinct values (NDV) for queries against partitioned tables.<ul
          id="ul_nxx_p3p_x2b">
          <li>When aggregating NDV estimates across multiple leaf child partitions, HLL statistics
            generate a more accurate NDV estimates than the standard table statistics. </li>
          <li>When updating HLL statistics, <codeph>ANALYZE</codeph> operations are required only on
            leaf child partitions that have changed. For example, <codeph>ANALYZE</codeph> is
            required if the leaf child partition data has changed, or if the leaf child partition
            has been exchanged with another table. For more information about updating partitioned
            table statistics, see <xref href="#topic1/section5" format="dita">Notes</xref>.</li>
        </ul></p>
      <note type="important">If you intend to run queries on partitioned tables with GPORCA
        enabled (the default), then you must collect statistics on the root partition of the
        partitioned table with the <codeph>ANALYZE</codeph> or <codeph>ANALYZE
          ROOTPARTITION</codeph> command. For information about collecting statistics on partitioned
        tables and when the <codeph>ROOTPARTITION</codeph> keyword is required, see <xref
          href="#topic1/section5" format="dita">Notes</xref>. For information about GPORCA, see
          <xref href="../../admin_guide/query/topics/query-piv-opt-overview.html" format="html" scope="external">Overview of GPORCA</xref><ph
          otherprops="op-print"> in the <cite>Greenplum Database Administrator Guide</cite></ph>. </note>
      <note>You can also use the Greenplum Database utility <codeph>analyzedb</codeph> to update
        table statistics. The <codeph>analyzedb</codeph> utility can update statistics for multiple
        tables concurrently. The utility can also check table statistics and update statistics only
        if the statistics are not current or do not exist. For information about the utility, see
        the <cite>Greenplum Database Utility Guide</cite>.</note>
    </section>
    <section id="section4">
      <title>Parameters</title>
      <parml>
        <plentry>
          <pt>{ <varname>root_partition_table_name</varname> | <varname>leaf_partition_table_name</varname> } [
              (<varname>column</varname> [, ...] ) ]</pt>
          <pd>Collect statistics for partitioned tables including HLL statistics. HLL statistics are
            collected only on leaf child partitions. </pd>
          <pd><codeph>ANALYZE <varname>root_partition_table_name</varname></codeph>, collects statistics on all
            leaf child partitions and the root partition. </pd>
          <pd><codeph>ANALYZE <varname>leaf_partition_table_name</varname></codeph>, collects statistics on the
            leaf child partition. </pd>
          <pd>By default, if you specify a leaf child partition, and all other leaf child partitions
            have statistics, <codeph>ANALYZE</codeph> updates the root partition statistics. If not
            all leaf child partitions have statistics, <codeph>ANALYZE</codeph> logs information
            about the leaf child partitions that do not have statistics. For information about when
            root partition statistics are collected, see <xref href="#topic1/section5" format="dita"
              >Notes</xref>.</pd>
        </plentry>
        <plentry>
          <pt>ROOTPARTITION [ALL]</pt>
          <pd>Collect statistics only on the root partition of partitioned tables based on the data
            in the partitioned table. If possible, <codeph>ANALYZE</codeph> uses leaf child
            partition statistics to generate root partition statistics. Otherwise,
              <codeph>ANALYZE</codeph> collects the statistics by sampling leaf child partition
            data. Statistics are not collected on the leaf child partitions, the data is only
            sampled. HLL statistics are not collected. </pd>
          <pd>For information about when the <codeph>ROOTPARTITION</codeph> keyword is required, see
              <xref href="#topic1/section5" format="dita">Notes</xref>.</pd>
          <pd>When you specify <codeph>ROOTPARTITION</codeph>, you must specify either
              <codeph>ALL</codeph> or the name of a partitioned table. </pd>
          <pd>If you specify <codeph>ALL</codeph> with <codeph>ROOTPARTITION</codeph>, Greenplum
            Database collects statistics for the root partition of all partitioned tables in the
            database. If there are no partitioned tables in the database, a message stating that
            there are no partitioned tables is returned. For tables that are not partitioned tables,
            statistics are not collected. </pd>
          <pd>If you specify a table name with <codeph>ROOTPARTITION</codeph> and the table is not a
            partitioned table, no statistics are collected for the table and a warning message is
            returned.</pd>
          <pd>The <codeph>ROOTPARTITION</codeph> clause is not valid with <codeph>VACUUM
              ANALYZE</codeph>. The command <codeph>VACUUM ANALYZE ROOTPARTITION</codeph> returns an
            error. </pd>
          <pd>The time to run <codeph>ANALYZE ROOTPARTITION</codeph> is similar to the time to
            analyze a non-partitioned table with the same data since <codeph>ANALYZE
              ROOTPARTITION</codeph> only samples the leaf child partition data. </pd>
          <pd>For the partitioned table <i>sales_curr_yr</i>, this example command collects
            statistics only on the root partition of the partitioned table. <codeph>ANALYZE
              ROOTPARTITION sales_curr_yr; </codeph></pd>
          <pd>This example <codeph>ANALYZE</codeph> command collects statistics on the root
            partition of all the partitioned tables in the database.
            <codeblock>ANALYZE ROOTPARTITION ALL;</codeblock></pd>
        </plentry>
        <plentry>
          <pt>VERBOSE</pt>
          <pd>Enables display of progress messages. When specified, <codeph>ANALYZE</codeph> emits
            this information <ul id="ul_phq_1n2_jr">
              <li>The table that is being processed.</li>
              <li>The query that is run to generate the sample table.</li>
              <li>The column for which statistics is being computed.</li>
              <li>The queries that are issued to collect the different statistics for a single
                column.</li>
              <li>The statistics that are collected.</li>
            </ul></pd>
        </plentry>
        <plentry>
          <pt>
            <varname>table</varname>
          </pt>
          <pd>The name (possibly schema-qualified) of a specific table to analyze. If omitted, all regular tables (but not foreign tables)
            in the current database are analyzed. </pd>
        </plentry>
        <plentry>
          <pt>
            <varname>column</varname>
          </pt>
          <pd>The name of a specific column to analyze. Defaults to all columns. </pd>
        </plentry>
      </parml>
    </section>
    <section id="section5">
      <title>Notes</title>
      <p>Foreign tables are analyzed only when explicitly selected. Not all foreign data wrappers
        support <codeph>ANALYZE</codeph>. If the table's wrapper does not support
          <codeph>ANALYZE</codeph>, the command prints a warning and does nothing.</p>
      <p>It is a good idea to run <codeph>ANALYZE</codeph> periodically, or just after making major
        changes in the contents of a table. Accurate statistics helps Greenplum Database choose the
        most appropriate query plan, and thereby improve the speed of query processing. A common
        strategy for read-mostly databases is to run <codeph><xref href="./VACUUM.xml#topic1"
            type="topic" format="dita"/></codeph> and <codeph>ANALYZE</codeph> once a day during a
        low-usage time of day. (This will not be sufficient if there is heavy update activity.) You
        can check for tables with missing statistics using the <codeph>gp_stats_missing</codeph>
        view, which is in the <codeph>gp_toolkit</codeph> schema:</p>
      <codeblock>SELECT * from gp_toolkit.gp_stats_missing;</codeblock>
      <p><codeph>ANALYZE</codeph> requires <codeph>SHARE UPDATE EXCLUSIVE</codeph> lock on the
        target table. This lock conflicts with these locks: <codeph>SHARE UPDATE EXCLUSIVE</codeph>,
          <codeph>SHARE</codeph>, <codeph>SHARE ROW EXCLUSIVE</codeph>, <codeph>EXCLUSIVE</codeph>,
          <codeph>ACCESS EXCLUSIVE</codeph>.</p>
      <p>If you run <codeph>ANALYZE</codeph> on a table that does not contain data, statistics are
        not collected for the table. For example, if you perform a <codeph>TRUNCATE</codeph>
        operation on a table that has statistics, and then run <codeph>ANALYZE</codeph> on the
        table, the statistics do not change.</p>
      <p>For a partitioned table, specifying which portion of the table to analyze, the root
        partition or subpartitions (leaf child partition tables) can be useful if the partitioned
        table has a large number of partitions that have been analyzed and only a few leaf child
        partitions have changed. <note>When you create a partitioned table with the <codeph>CREATE
            TABLE</codeph> command, Greenplum Database creates the table that you specify (the root
          partition or parent table), and also creates a hierarchy of tables based on the partition
          hierarchy that you specified (the child tables). </note></p>
      <ul>
        <li id="bd138200">When you run <codeph>ANALYZE</codeph> on the root partitioned table,
          statistics are collected for all the leaf child partitions. Leaf child partitions are the
          lowest-level tables in the hierarchy of child tables created by Greenplum Database for use
          by the partitioned table. </li>
        <li id="bd138204">When you run <codeph>ANALYZE</codeph> on a leaf child partition,
          statistics are collected only for that leaf child partition and the root partition. If
          data in the leaf partition has changed (for example, you made significant updates to the
          leaf child partition data or you exchanged the leaf child partition), then you can run
          ANALYZE on the leaf child partition to collect table statistics. By default, if all other
          leaf child partitions have statistics, the command updates the root partition statistics.
            <p>For example, if you collected statistics on a partitioned table with a large number
            partitions and then updated data in only a few leaf child partitions, you can run
              <codeph>ANALYZE</codeph> only on those partitions to update statistics on the
            partitions and the statistics on the root partition.</p></li>
        <li>When you run <codeph>ANALYZE</codeph> on a child table that is not a leaf child
          partition, statistics are not collected. <p>For example, you can create a partitioned
            table with partitions for the years 2006 to 2016 and subpartitions for each month in
            each year. If you run <codeph>ANALYZE</codeph> on the child table for the year 2013 no
            statistics are collected. If you run <codeph>ANALYZE</codeph> on the leaf child
            partition for March of 2013, statistics are collected only for that leaf child
            partition. </p></li>
      </ul>
      <p>For a partitioned table that contains a leaf child partition that has been exchanged to use
        an external table, <codeph>ANALYZE</codeph> does not collect statistics for the external
        table partition:</p>
      <ul id="ol_x5n_2ff_ss">
        <li>If <codeph>ANALYZE</codeph> is run on an external table partition, the partition is not
          analyzed.</li>
        <li>If <codeph>ANALYZE</codeph> or <codeph>ANALYZE ROOTPARTITION</codeph> is run on the root
          partition, external table partitions are not sampled and root table statistics do not
          include external table partition. </li>
        <li>If the <codeph>VERBOSE</codeph> clause is specified, an informational message is
          displayed: <codeph>skipping external table</codeph>.</li>
      </ul>
      <p>The Greenplum Database server configuration parameter <codeph><xref
            href="../config_params/guc-list.xml#optimizer_analyze_root_partition"
            >optimizer_analyze_root_partition</xref></codeph> affects when statistics are collected
        on the root partition of a partitioned table. If the parameter is <codeph>on</codeph> (the
        default), the <codeph>ROOTPARTITION</codeph> keyword is not required to collect statistics
        on the root partition when you run <codeph>ANALYZE</codeph>. Root partition statistics are
        collected when you run <codeph>ANALYZE</codeph> on the root partition, or when you run
          <codeph>ANALYZE</codeph> on a child leaf partition of the partitioned table and the other
        child leaf partitions have statistics. If the parameter is <codeph>off</codeph>, you must
        run <codeph>ANALZYE ROOTPARTITION</codeph> to collect root partition statistics.</p>
      <p>The statistics collected by <codeph>ANALYZE</codeph> usually include a list of some of the
        most common values in each column and a histogram showing the approximate data distribution
        in each column. One or both of these may be omitted if <codeph>ANALYZE</codeph> deems them
        uninteresting (for example, in a unique-key column, there are no common values) or if the
        column data type does not support the appropriate operators.</p>
      <p>For large tables, <codeph>ANALYZE</codeph> takes a random sample of the table contents,
        rather than examining every row. This allows even very large tables to be analyzed in a
        small amount of time. Note, however, that the statistics are only approximate, and will
        change slightly each time <codeph>ANALYZE</codeph> is run, even if the actual table contents
        did not change. This may result in small changes in the planner's estimated costs shown by
          <codeph>EXPLAIN</codeph>. In rare situations, this non-determinism will cause the query
        optimizer to choose a different query plan between runs of <codeph>ANALYZE</codeph>. To
        avoid this, raise the amount of statistics collected by <codeph>ANALYZE</codeph> by
        adjusting the <varname>default_statistics_target</varname> configuration parameter, or on a
        column-by-column basis by setting the per-column statistics target with <codeph>ALTER TABLE
          ... ALTER COLUMN ... SET (n_distinct ...)</codeph> (see <codeph>ALTER TABLE</codeph>). The
        target value sets the maximum number of entries in the most-common-value list and the
        maximum number of bins in the histogram. The default target value is 100, but this can be
        adjusted up or down to trade off accuracy of planner estimates against the time taken for
          <codeph>ANALYZE</codeph> and the amount of space occupied in
        <codeph>pg_statistic</codeph>. In particular, setting the statistics target to zero disables
        collection of statistics for that column. It may be useful to do that for columns that are
        never used as part of the <codeph>WHERE</codeph>, <codeph>GROUP BY</codeph>, or
          <codeph>ORDER BY</codeph> clauses of queries, since the planner will have no use for
        statistics on such columns. </p>
      <p>The largest statistics target among the columns being analyzed determines the number of
        table rows sampled to prepare the statistics. Increasing the target causes a proportional
        increase in the time and space needed to do <codeph>ANALYZE</codeph>.</p>
      <p>One of the values estimated by <codeph>ANALYZE</codeph> is the number of distinct values
        that appear in each column. Because only a subset of the rows are examined, this estimate
        can sometimes be quite inaccurate, even with the largest possible statistics target. If this
        inaccuracy leads to bad query plans, a more accurate value can be determined manually and
        then installed with <codeph>ALTER TABLE ... ALTER COLUMN ... SET STATISTICS
          DISTINCT</codeph> (see <xref href="ALTER_TABLE.xml#topic1"/>).</p>
      <p>When Greenplum Database performs an <codeph>ANALYZE</codeph> operation to collect
        statistics for a table and detects that all the sampled table data pages are empty (do not
        contain valid data), Greenplum Database displays a message that a <codeph>VACUUM
          FULL</codeph> operation should be performed. If the sampled pages are empty, the table
        statistics will be inaccurate. Pages become empty after a large number of changes to the
        table, for example deleting a large number of rows. A <codeph>VACUUM FULL</codeph> operation
        removes the empty pages and allows an <codeph>ANALYZE</codeph> operation to collect accurate
        statistics. </p>
      <p>If there are no statistics for the table, the server configuration parameter <codeph><xref
            href="../config_params/guc-list.xml#gp_enable_relsize_collection"/></codeph> controls
        whether the Postgres Planner uses a default statistics file or estimates the size of a
        table using the <codeph>pg_relation_size</codeph> function. By default, the Postgres Planner
        uses the default statistics file to estimate the number of rows if statistics are not
        available.</p>
    </section>
    <section id="section6">
      <title>Examples</title>
      <p>Collect statistics for the table <codeph>mytable</codeph>:</p>
      <codeblock>ANALYZE mytable;</codeblock>
    </section>
    <section id="section7">
      <title>Compatibility</title>
      <p>There is no <codeph>ANALYZE</codeph> statement in the SQL standard. </p>
    </section>
    <section id="section8">
      <title>See Also</title>
      <p><codeph><xref href="ALTER_TABLE.xml#topic1" type="topic" format="dita"/></codeph>,
            <codeph><xref href="./EXPLAIN.xml#topic1" type="topic" format="dita"/></codeph>,
            <codeph><xref href="./VACUUM.xml#topic1" type="topic" format="dita"/></codeph>, 
        <codeph><xref href="../../utility_guide/ref/analyzedb.xml#topic1" type="topic" format="dita"/></codeph>.</p>
    </section>
  </body>
</topic>
